{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using evaluations with the SDK\n",
    "In this cookbook we will show how to interact with evaluation in agenta programatically. Either using the SDK (or the raw API). \n",
    "\n",
    "We will do the following:\n",
    "\n",
    "- Create a test set\n",
    "- Create and configure an evaluator\n",
    "- Run an evaluation\n",
    "- Retrieve the results of evaluations\n",
    "\n",
    "We assume that you have already created an LLM application and variants in agenta. \n",
    "\n",
    "\n",
    "### Architectural Overview:\n",
    "In this scenario, evaluations are executed on the Agenta backend. Specifically, Agenta invokes the LLM application for each row in the test set and subsequently processes the output using the designated evaluator. \n",
    "This operation is managed through Celery tasks. The interactions with the LLM application are asynchronous, batched, and include retry mechanisms. Additionally, the batching configuration can be adjusted to avoid exceeding the rate limits imposed by the LLM provider.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install -U agenta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configuration Setup\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assuming an application has already been created through the user interface, you will need to obtain the application ID.\n",
    "# In this example we will use the default template single_prompt which has the prompt \"Determine the capital of {country}\"\n",
    "\n",
    "# You can find the application ID in the URL. For example, in the URL https://cloud.agenta.ai/apps/666dde95962bbaffdb0072b5/playground?variant=app.default, the application ID is `666dde95962bbaffdb0072b5`.\n",
    "from agenta.client.backend.client import AgentaApi\n",
    "# Let's list the applications\n",
    "client.apps.list_apps()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "app_id = \"667d8cfad1812781f7e375d9\"\n",
    "\n",
    "# You can create the API key under the settings page. If you are using the OSS version, you should keep this as an empty string\n",
    "api_key = \"EUqJGOUu.xxxx\"\n",
    "\n",
    "# Host. \n",
    "host = \"https://cloud.agenta.ai\"\n",
    "\n",
    "# Initialize the client\n",
    "\n",
    "client = AgentaApi(base_url=host + \"/api\", api_key=api_key)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from agenta.client.backend.types.new_testset import NewTestset\n",
    "\n",
    "csvdata = [\n",
    "        {\"country\": \"france\", \"capital\": \"Paris\"},\n",
    "        {\"country\": \"Germany\", \"capital\": \"paris\"}\n",
    "    ]\n",
    "\n",
    "response = client.testsets.create_testset(app_id=app_id, request=NewTestset(name=\"test set\", csvdata=csvdata))\n",
    "test_set_id = response.id\n",
    "\n",
    "# let's now update it\n",
    "\n",
    "csvdata = [\n",
    "        {\"country\": \"france\", \"capital\": \"Paris\"},\n",
    "        {\"country\": \"Germany\", \"capital\": \"Berlin\"}\n",
    "    ]\n",
    "\n",
    "client.testsets.update_testset(testset_id=test_set_id, request=NewTestset(name=\"test set\", csvdata=csvdata))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create evaluators"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an evaluator that performs an exact match comparison on the 'capital' column\n",
    "# You can find the list of evaluator keys and evaluators and their configurations in https://github.com/Agenta-AI/agenta/blob/main/agenta-backend/agenta_backend/resources/evaluators/evaluators.py\n",
    "response = client.evaluators.create_new_evaluator_config(app_id=app_id, name=\"capital_evaluator\", evaluator_key=\"auto_exact_match\", settings_values={\"correct_answer_key\": \"capital\"})\n",
    "exact_match_eval_id = response.id\n",
    "\n",
    "code_snippet = \"\"\"\n",
    "from typing import Dict\n",
    "\n",
    "def evaluate(\n",
    "    app_params: Dict[str, str],\n",
    "    inputs: Dict[str, str],\n",
    "    output: str,  # output of the llm app\n",
    "    datapoint: Dict[str, str]  # contains the testset row \n",
    ") -> float:\n",
    "    if output and output[0].isupper():\n",
    "        return 1.0\n",
    "    else:\n",
    "        return 0.0\n",
    "\"\"\"\n",
    "\n",
    "response = client.evaluators.create_new_evaluator_config(app_id=app_id, name=\"capital_letter_evaluator\", evaluator_key=\"auto_custom_code_run\", settings_values={\"code\": code_snippet})\n",
    "letter_match_eval_id = response.id"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get list of all evaluators\n",
    "client.evaluators.get_evaluator_configs(app_id=app_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Run an evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = client.apps.list_app_variants(app_id=app_id)\n",
    "print(response)\n",
    "myvariant_id = response[0].variant_id"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run an evaluation\n",
    "from agenta.client.backend.types.llm_run_rate_limit import LlmRunRateLimit\n",
    "response = client.evaluations.create_evaluation(app_id=app_id, variant_ids=[myvariant_id], testset_id=test_set_id, evaluators_configs=[exact_match_eval_id, letter_match_eval_id],\n",
    "                                                rate_limit=LlmRunRateLimit(\n",
    "        batch_size=10, # number of rows to call in parallel\n",
    "        max_retries=3, # max number of time to retry a failed llm call\n",
    "        retry_delay=2, # delay before retrying a failed llm call\n",
    "        delay_between_batches=5, # delay between batches\n",
    "    ),)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the status\n",
    "client.evaluations.fetch_evaluation_status('667d98fbd1812781f7e3761a')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fetch the overall results\n",
    "response = client.evaluations.fetch_evaluation_results('667d98fbd1812781f7e3761a')\n",
    "\n",
    "results = [(evaluator[\"evaluator_config\"][\"name\"], evaluator[\"result\"]) for evaluator in response[\"results\"]]\n",
    "# End of  Selection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fetch the detailed results\n",
    "client.evaluations.fetch_evaluation_scenarios(evaluations_ids='667d98fbd1812781f7e3761a')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
